July 6, 2019

Preview

  1. Introduction to single-cell RNA-seq
  2. Quality control and normalization
  3. Survey of downstream analysis methodology

Highly variable genes

  • Goal: identify a set of genes with high variability attributable to biology (over and above technical variability)
  • Useful for vizualization, dimension reduction, clustering, marker gene selection, etc
  • Challenging without orthogonal measurement of technical variability (e.g. spikeins)
    • with spikeins: select genes with variance significantly above mean-variance trend in control genes
    • without spikeins: select genes with variance significantly above overall mean-variance trend in all endogenous genes (assumes variance of most genes is purely technical)

Highly variable genes with spikeins

Highly variable genes without spikeins

Review: Dimension reduction

  • Useful to summarize & visualize relationships between cells in low dimensional space
  • Commonly used approaches:
    • PCA (Principal Components Analysis)
    • tSNE (t-distributed Stochastic Neighbor Embedding)
    • UMAP (Uniform Manifold Approximation and Projection)
  • Clustering can be carried out on reduced dimensions, but with caution

Review: PCA vs Nonlinear methods

  • PCA attempts to extract the largest components of variation in the data
  • Nonlinear methods such as tSNE and UMAP attempt to map points to a global coordinate system that preserves local structure
    • density & distance of points not preserved
    • better at visualizing rare subtypes

image source: http://carbonandsilicon.net/rblogging/2018/02/27/UMAP_plots

General purpose clustering

  • Hierarchical: build and cut dendrogram based on pairwise distance matrix
  • K-means: iteratively assign cells to nearest cluster center
  • Density-based: clusters are defined as areas of higher density (e.g. DBSCAN)
  • Graph-based: assumes data can be represented as a graph structure; doesn’t require estimating pairwise distance matrix (e.g. SNN-Cliq)

drawingdrawingdrawing

Clustering for single-cell

  • Goal: automatically identify subpopulations of cell types/states
  • Many methods adapted/developed for single-cell to account for:
    • high dropout rate
    • batch effects
    • high dimensionality

Andrews & Hemburg 2018 (https://doi.org/10.1016/j.mam.2017.07.002)

Evaluation of clustering results

  • Many clustering algorithms require specification of the number of clusters (e.g. K in K-means)
  • Different algorithms may provide vastly different clusterings

Risso et al 2018 (https://doi.org/10.1371/journal.pcbi.1006378)

Metrics for evaluating clustering results

  • Single clustering:
    • Average silhouette: how similar a cell is to its own cluster (average distance within) compared to other clusters (average distance between); not efficient for large numbers of cells \[s(b) = \frac{\bar{d}_{\text{within}}(b) - \bar{d}_{\text{between}}(b)}{max(\bar{d}_{\text{within}}(b), \bar{d}_{\text{between}}(b))}\]
    • Modularity score: difference between number of within-cluster edges to the expected number under the null (random edges)
  • Comparing two clusterings
    • Adjusted rand index: similarity measure between two clusterings based on number of cells grouped accordingly and adjusted for random groupings

Benchmark of clustering methods for single-cell

Review: Differential expression

  • Traditional differential expression for bulk RNA-seq aims at detecting shifts in mean (fold change)
  • Read counts \(y_{bg}\) typically modeled with a two parameter distribution (e.g. mean \(\mu_{bg}\), dispersion \(\alpha_g\))
  • Recall DESeq2 model given size factors \(s_b\): \[ y_{bg} \sim NB(\mu_{bg}, \alpha_g)\] \[\mu_{bg}=s_b q_{bg}\] \[ log_2(q_{bg}) = x_{b.}\beta_g \]

Bulk RNA-seq measures averages

Heterogeneity hidden in bulk RNA-seq

Distributions of single-cell read counts

scDD

scDD detects more subtle and complex changes

DE after clustering

R tools for downstream analysis

  • scater: visualization, quality control (Bioconductor)
  • scran: normalization, doublet detection, batch effect correction (Bioconductor)
  • SCnorm: normalization (Bioconductor)
  • sctransform: normalization (CRAN)
  • DropletUtils: removal of empty droplets (Bioconductor)
  • Seurat: normalization (CRAN)

drawingdrawingdrawing

There are many more tools I didn’t mention…

Growing number of computational tools

Curated list of tools from Sean